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c ^ *S?\<XJ\ jsu, cM& Ll* assignment_2 II ^^.L </UiJ cuJ*^ ^1 lsj> * *^U~JI ^ *>L \JI 

© 



p 



6^ V 



J. SI 



classification Jl ^aso &^ JL^i* 

hl. > vJ^iia Lil>l oj£=^ J^ J^=^JL decision tree II *jS S> ^ >>& 

classification algorithm Jl ^ *j£ j^o Ijlj 



J^UU^o ^P ( j^o (yvP^_*«j Lj-o^l 



iyv Uj «<<j tree n u?t> ^a-*— «> Jl v— ? 



Jl L»_ .1 ^-U JL>I ^>S> * y«^ LW tree Jl c>£=^ *kS j&^. cybLJI J Uj ,£1^=u^JI LjI 

decision Jl g. ,> ,_^g jii-L LU ^>n| J'UU ^JI^^JI £^i L ^>^l l _^so tree pruning 

^JU- UJLik tree 



P 



Decision Tree Induction 

Attribute Selection Measures 



Age? 





tad^Lxatios? 


fair 


Nn 




\ 


vA\J 


sen For 


exc 


ellent 

V 

yes 


Yes 


middle-aged 


yes 


^™^J 


youth 




Yes 




Student? 


no 


No 
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: c^V^' o^^-JI 



<SJ>^ 



statistics Jl JL^, *> «, pruning JL^o bl jLJL^JI ,bl ^> Prepruning (1 
jLJL^JI ji^o pruning JL^o bl ^^ ^ J&u>l ^1 JUiJI *> Postpruning (2 



Decision Tree Induction 

Tree Pruning 

^ Data may be oyerfftted to dataset anomalies and outliers 
** Pruning removes the least reliable branches 
** DT becomes less complex 

^^XMninS ~~ ^ statistically assess the goodness of a split before it 
takes place 
** hard to choose thresholds for statistical significance 
• PostDrujnJn^ -> remove sub-trees from already constructed trees 

1. remove sub-tree branches and replace with leaf node 

2. leaf is labeled with most frequent class in sub-tree 



p 



p 




majority Jl jU^. JL~* bb ^ *>L> node ^1^=^ g^* branch Jl JJu, LJ ,j> jL_ Jl ^> 



•->j,£=*\ ^U-U jtfl US class b ^ 



Decision Tree Induction 

Tree Pruning 
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frequent II ^ ^b c>£=*± <s> s±~*~ association rule Jl J> L* rule Jl &\ J^S 

decision tree Jl ^ U^± L* rule Jl LI pattern 

If age = senior And credirating = fair Then Buys computer = No 

U S S^» i^-^ 7 <j> rule Jl ^y>l ^lllil jsu L* 



JI^J 



coverage and accuracy Jl ^j c*^ 1 ^ &J* c^ 
coverage^) = ^ 
accuracytR) = ^^ 



Decision Tree Induction 

Rule Extract! on from a Decision Tree 

* Rules represent information and knowledge 

*# IF you study weft THEN you'll succeed 

** IF you're a student AND you have 5000LE THEN you most probably will buy 
an [Pad (confidence?) 

• How to assess the goodness of a rule ? 



coverageiR^) 
accuracy [R) ■■ 



^covers 

\d\ 

^correct 



I 



P 



P 



JUtf rule U ^Ljo. ^> ^M ^-11 ^ JUI JbJI 
Rl : (age = youth) ~ (student = yes) 

=>(buys computer = yes) 
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rule Jl IfcUfci. JQI records Jl >j^ ^ v.Lp coverage Jl ^ ^- Jl^JI 



bb 



JUIS 7 ^jtjL^i* Ul^ 7 records Jl >j^ JiP 



P 



coverage(Rl) = — = 14.28% 



j>\S j>jjt^ 2 k£dLL L» y rule Jl UJJc, JUI records Jl ^jskil JL ^>y^ accuracy Jl LI 
j&\ij ^Jl g.LLt> 2 g ^-^j.1 ^S' ^^N-* yes ^ ^^ g-LL„> ^S^JI J<?> c^S' ^> ^-U 



Decision Tree Induction 

Rule Extraction from a Decision Tree— What are Rules? 


1 RID age income student Credjtjrating^. Class: bujf^_cjorn£uter 1 






1 youth high no fair no 

2 youth high no excellent no 

3 middle aged high no fair yes 

4 senior medium no fair yes 


Rl : (age = youth) A (student = yes) 
=$>{buys computer = yes) 


5 senior low yes fair yes 




6 senior low yes excellent no 

7 middle aged low yes excellent yes 


2 
cover age{Rl) = — = 14.2 8% 
1 4 


S youth medium no fair no 




9 youth low yes fair yes 

10 senior medium yes fair yes 


2 
ctcaurctcyiRl) = — = 100% 


11 youth medium yes excellent yes 




12 middle aged medium no excellent yes 

13 middle aged high yes fair yes 

14 senior medium no excellent no 


X: {age = youth, income = medium, 
student - yes, credit^ratinQ^faif) 




D ata Mining 2013 - CI assifi cat 




ion 33 5/5/2013 



rules Jl ^f=u-JI ^ L» JI^JI \^ decision tree Jl ^ rule Jl JL>I ^lj>l *j^ b > 



mutual exclusive ^S ^M^ S : 



^UrSI 



j> <&<^>j J UwL classification ^LcJLLj* $c^ rule j> ^iii-> J record j;j^ ^^ 
© ^iii^ ^1 *^» ^ c> >^J' *> ^L^ i^b classification ^jp.LL^ ^L- ^b rule 

credit Jk ^LU o^Je^ bl J L ?**•£ <sjJt^ ^JU ^ *ie^ bl J ^jcj> ^JUI JLJI v3 : NJ^ 



V^ CUL^> Q <J^ *f==*J 



9V^ 



i^JL£<-£> 



P 
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Jk^JI UM=> 9 decision tree Jl ^ rule gJLU j>&\ ^l>l Jb/e^ ^> ^U- ^JUI * >^JI ^i 
^^ g- c>>^» J^° rules Jl c^l jy ia^ ^ 9U- bl vj>^l^SL iaJL~-> ^Jl 



P 



Decision Tree Induction 

Rule Extraction from a Decision Tree 

** Create one rule for each path from root to leaf in the decision 
tree 

1. Each splitting criterion is ANDed to form rule antecedent (IF) 

2. Leaf node holds class prediction (THEN) 




Can the rules resulting 

from decision trees have 

conflicts? 



Rl; IF age =youth AND student =no THEN buys cornpiJter=no 




P 
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!? aj! ^jjoLujI p <j£J 6A^.lj a^-Lq jli^.1 *iV rule o* j^l ls^c- j^ ! Jtj^ 
!? ^1 ^LJ g ^ ^^1 ^>S size ordering J.^\ 



jS\ JL*>Lb 1*j> o£=h rule _JI L J^ 7 ^^ size ordering _JI ^^> ^A^ J~&± <jj^> L ^ 
strong ^^^^3^ J&\ priority l*J c^=^ ^1 ^ j£\ attributes 1*j> JUI u? ^o JU>I 



Jl JU> ^^^ bl S rule Jl c> JUI if Jl ^r JU> ^1 U Jju ^^ Rule Ordering ^bJI 
rule-based ordering J class-based ordering US U^ ^JU> rule 



u/CmJ y*~A 



V 



v> o 6 ^b confidence Jl J support Jl 



c^*^ <_r^ 



c^==v> (jjuP match jJU^L. 



c^S Fallback_(default)_rule ^a^> c^=m ^j>S 

> y=r ^o default Jl csjo 



DecisionTree Induction 

Rule Extract] on from a DT— Resolving Rules Conflicts 

Rules conflicts are the result of a tuple firing more than one rule with 
different class predictions 
Two resolution strategies 

• Size Ordering ->rule with largest antecedent ( toughest ) has 
highest priority -> fires and returns class prediction 

<* Rule Ordering -> rules prioritized apriori according to 

** Class-based ordering -> decreasing importance (most frequent are highest 

- order of prevalence) 
+* Rule-based ordering -> measures of rule quality (e.g. accuracy, size, 

domain expertise) 

Fallback (default) rule when no rufes are triggered 




p 



p 



decision tree JL L >^.U J^ 1±j> L* j^J 
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Agenda 



i i_i* 









fi cation 



• E ayes' Theorem 

• Naive Eayesian Classification 




K-Nearest Neighbor Classifiers 



J 



Metrics for Evaluating Classifiers Performance 
Holdout, Random Sub samp ling and Cross -Validation 



Naive Bayesian classifier II ^J J classification algorithm ^pl ^jlj> 

assignment_2 Jl ^> JVy. *> M* JU- *> c/fi -S^ ^~y 



P 



P 



Naive Bayesian is called that way because it assumes class conditional 
independence, which means that the effect of an attribute value on a 
class is independent of the values of other attributes. 

bayes £^\ j*Aj ^£L> JUL, X**^\j o>SL^St ^M jui^.^ 



' (1 






independent £>£=*} attributes Jl c>l IftAsr ^ !*>*• & <X*&t 



V 
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Bayes Classification Methods 

* Naive Bayesian classifier — > Statistical classifier that predicts 
the probability that a tuple belongs to a specific class 
** Based on Bayes Theorem -> Bayes was an 18 th century 

clergyman who worked on probability 
** High accuracy 
** Speed 

** Class- conditio rial Independence — > Attributes' effect on class 
determination is independent 




P 



P 



P(H\X) = 



P(X\H)P{H) 



jh rjt, fa &£=** \^ »^ £>*<>*- C^JI •^ Conditional probability 

. data set Jl *^r record II J\ JL^II P(X) 



Bayes Classification Methods 

Bayes' Theorem 

* X is a tuple representing "evidence" H is the hypothesis € *X e C" 
► Goal: determine posteriori probability P(H\X~) — > probability 

that H holds given that we "observed" X 
** i.e. probability that X e C given that we know attribute description of X 
-=* PCX\H^) — > probability that X has specific attribute values given that we 

know its class 
*«* Posteriori probability is based on more information [conditional] 

* P(^H) is priori probability of H -> probability that any tuple 
belongs to a class, independent of its attribute values 

*s» P(^X} is probability that X has specific attribute values 

* Bayes* Theorem 

P(^X\H^P<iH) 



P(Ji\X} 



P(X^ 
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U given o>U^Uo. fay. J& ,h< S S> jyv~± p(H) Jl ^y^S^J, ^ 



P 



Bayes Classification Methods 

Naive Bayesian Classification 

• To reduce computation of PC^I C;), attributes are assumed to be 
independent — > hence the "naive" in the name 



** If attribute is categorical 
*» If attribute is numerical 



| t -i J D J Xf i | 

assume Gaussian distribution 



SE 

p<Lx k \c^ = 



f~2. TC CF C . 

Evaluate for each C if assign class label of class with max P^X\ Qi} 




Bayes Classification IVlethiocIs 

NaTve Bayesian Classification 

* Given tuples vtrith rt attributes and m classes, Waive Bayes 
predicts that Xbelong to class xvith highest posteriori 
probability 

► Ci is called the maximum posteriori hypothesis 

► Since F* t^O is constant, maximize only numerator 

► If P(Q) is unknoxvn for all z f assume uniform probability 

*s* Then you. only have to maximize F* C. 3£ \ C± } 



Otherwise, F> CC*) 



ki^l 



1/ 



P 






6^" 



-t JllJI vJ^iJ J^V ^t 



: Two Class Label j£j\> KSiX& bl 

9 (yf ^>^j yes ^^r^ 7 c^~* LS (1 
5 ^^j^ No -tj^j^ <sj^~j> J~* US (2 
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1- compute ftCj): 
P(H)=>HC 1 ) = 9/14 = 0.643 



P(H)=>I\C 2 ) = 5/14 = 0.357 



P 



j&hi bljdl 3 record Jl ^ ^JLi^\ ^ p(X) *jtf ^<x& 



*o£ j&t LjLso> JUI o>bL~JI J* wi ^b UsS p(X) Jl .-^..^1 ^^ j^ J^S 



^Jl^ ^b ^ vJ^b. jr> js^ jb U^l^sJ c>- P(X) Jr, p(H) Jl e~~»- vctf 

p(H|X) Jl s .\j>JP jzj\> 

vjL-t-tt Js^ \j ^ L^.....-^ vjtf o^Ls> two classes l*J^ bl p(H|X) JJ jL-Jb 
buys computer = Jl *^. cjL^ &Lj<* buys computer = yes Jl &U- ,JJI 

No 



P 



2- compute HX/Q: 

P(X|C1) = P(age=youth|buys_computer=yes) 
xP(income=medium|buys_computer=yes) 

xP(student=yes|buys_computer=yes) 
xP(credit_rating=fair|buys_computer=yes) 

JLs^i ^ >j^l <jjm> vj^-fJ rule Jl g ,4 i— yj- >j^^. ,_JJI JJLJI ^> J^yi* 

Jl^ j^Ijs&s 9o* 2^**sUyouth Jl ^^j^f ts^l ^ ^ J*> 
£*f. c*LS ,_,&& rule Jl ^jJU-l L a*J IJ£=**, 9 cy . 4 <yHt i^ Income 

2 4 6 6 _ _. . 

=-x-x-x-= 0.044 

9 9 9 9 



&LL c5 



^ JJI ^^1 o 
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nx/c 2 ) = 0.019 ■» x e c x 

^rSI UM^b^^sl^cASl o>b> J^S' P(H) ^ vv^s.1 J^SI ^>S 
^'l ,^-iJI o-jtf^J ^». S S^ l^>l*** *> record Jl j>^\j,J> ^.IjJI ^ ^S 



• iy\J»^=>J.^ 



^Uu, c^JU-l c ^=u- numerical attribute ^h, JlxJI e^^l 



o^Ui- ^pS' c->^i* S c*v ^V^ 1 ^ ^J* ^*V V^ ^^ data set ~^J 
probability of cl = probability of c2 ^\ ^ ^1 *.> ^-^ 



V 



P 



P 



Bayes Classification Methods 

Naive Bayesian Classification -Example 



1 


youth 




high 


no 


fair 


no 


2 


youth 




high 


no 


excellent 


no 


3 


middle aged 


high 


no 


fair 


yes 


4 


senior 




medium 


no 


fair 


yes 


5 


senior 




low 


yes 


fair 


yes 


6 


senior 




low 


yes 


excellent 


no 


7 


middle aged 


low 


yes 


excellent 


yes 


S 


youth 




medium 


no 


fair 


no 


9 


youth 




low 


yes 


fair 


yes 


10 


senior 




medium 


yes 


fair 


yes 


11 


youth 




medium 


yes 


excellent 


yes 


12 


middle aged 


medium 


no 


excellent 


yes 


13 


middle aged 


high 


yes 


fair 


yes 


14 


senior 




medium 


no 


excellent 


no 



■ 



C, = Yes = 9 



C 2 = No = 5 



X: [age = youth, income = medium, 
student = yes, cred^j^ating^faiir) 



1- compute P{Q)'. 
P{C 1 } = 9/14 = 0.643 
P(C 2 }= 5/14 = 0.357 



2- compute p(X/g): 

P(X/Q) = 

P( a ge=^ojjthJJc^^ 

* P( i n c o m e= ni££^!IlJJ^ 

x P( st u d e nt=^ejJjDuj/s_ODn^ 

=-X-X-X- = 0.044 

9 9 9 9 

P{XjC 2 ) = 0.019 -$x e q 
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Lazy Learners 

K-Nearest Neighbor Classifiers 

2 J ^4— «ju JU *> kmeans JU c^S IJ^=J l+-~M jj^j ^ ^t^ ^^ </^ ^ 

^ ^j^> UJ meanJI srW^ ju^^l-S^ ^ ±*& ^^> c^^^ c^A-S^J 

^li)I^LwN£=*N £W meanJt *>^, 
mean JJ ^^lLu bjc^ ^ r^yl> 

*>^ c>bcAJ^jS> J^> o^ ^> C x^ s ^i\ ^J&J ^ ^ J * £** ^L~JI 



Lazy Learners 

K-Nearest Neighbor Classifiers 

* Delay classification until new test data is available 

* Store training data meanwhile 

* Use similarity measure to compute distance between test data 
tuple and each of the training data tuples (Euclidian, Manhattan, ...) 

** Remember to normalize if ranges vary between attributes 

* k stands for the number of "closest" neighbors of a test data 



tuple according to measured distance 
& Majority voting of their class labels used to determine class of test tuple 



p 



p 




QMMMMMMMMMMMMMMMMMMMMMMMUMMMMMMMMMMMMMMMMMMMMMUMUUd 



pnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnna 



U S ^j ><x~*+ J>y>\j^\s> 142000 ^^juj^by 48 * 



: JllJI g ^ JU-oX; 

• jiiju JllJI v3 ll* 

jujLsr record 



P 



K J( >o^ >j^J ^U ^J^ Euclidean Jl c^ybb distance Jt ^^^* : Jj^\ 

k = 1 Jl jl^L bU ^i^dl b» jl>o bk threshold Jl ^ ^^ W«a)I (^^ v^=*^> 

^3 3 Jkb o ^bu k=3 Jl J • c?>-^ J^b c^bu 



• ly vli— — L, ^^ 



jl >\+j ju^\~£=cj data set Jl U ^1 



Lazy Learners 

K-Nearest Neighbor- Example 



1 25 

2 35 

3 45 

4 20 



35 



48 



Loan[$) Default Distance 



40000 
60000 
80000 
20000 



No 
No 
No 
No 



102000 
82000 
62000 
122000 



120000 No 



22000 



6 


52 


18000 


No 


124000 


7 


23 


95000 


Yes 


47000 


8 


40 


62000 


Yes 


80000 


9 


60 


100000 


Yes 


42000 


10 


48 


220000 


Yes 


78000 


11 


33 


150000 


Yes 


8000 





d = * VOi - yi) 2 + (*2 -y-z) 2 

k=l S NNis RID 11 

• Default =YES 

k=3 -» NNs are RIDs 11, 5, 9 

• Default = YES 



142000 



fair 



I 



P 
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!? ^> classification Jl J>l ^1 ^l>l ^LL^ Jk- ^^ 

data Jt J^ ^ j^> classification oJL> bl ^t >t >jb>^ accuracy Jt - 

set 

positive Jt J2 ^ j^> positive Jl ^ji^> ^t >t >jb>^ sensitivity Jt - 

UA& J&lU c/ft i 4 ^ bt ^itt c^S^Jt <j> positive class Jt - 



i 



JS 



M 



<s*s 



o>US^ 



^SiXS^ S^ e^^ ^b ^U- c^t c^> Negative Jt 



positive Jt c^==v*^ vLi'.^ ^^ bt^ ( 0>J<Jk>j J^y J^^lj^r J^r 

negative j>U\j 
^> (/ ^j^t> bL positive Jt tuples Jt Jtf L* true positive Jt 



U£ 



^ ■f*-**- ^ ^ 



bL positive Jt tuples Jt J2 Lj> false positive Jt 



•^0 t&JL*0 U)j 



bL negative Jt tuples Jt J2 Lj> true negative Jt 



U£ 



^ ■f*-*- ^ u ^ 



bL negative Jt tuples Jt JS* Lj* false negative Jt 



Model Evaluation 

Metrics for Evaluating Classifier Performance 



Measure 


Formula 


accuracy, recognition rate 


TP + TN 


F + /V 


error rate, rnisclassification rate 


FP + FN 


P + N 


sensitivity, true positive rate, recall 


TP 

T 


specificity, true negative rate 


TN 


precision 


TP 


TP + FP 



Positives -> 

tuples representing class of interest 
Negatives -> 

tuples representing other class[re) 
True Positives -> 

positive tuples correctly labeled 
False Positives -> 

negative tuples incorrectly labeled 
True Negatives -> 

negative tuples correctly labeled 
False Negatives -> 

positive tuples incorrectly labeled 



p 



p 
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Model Evaluation 

Metrics for Evaluating Classifier Performance 



I 



Predicted 



Actual 











Yes 


TP 


FN 


P 


No 


FP 


TN 


N 


Totaf 


P 


aF 


P -f- N 



Confusion Matrix 



PmJ.LLBJLW.IJlM 



P 



P 



: JL 



II ^ j±cJ\ balanced classes ^*^ *> JliJI ^^-S' 



Accuracy= 



TP+TN 
P+N 



JL j£ut»* imbalanced JliJI J, 



TP 



Sensitivity= — 



Model Evaluation 

Metrics for Evaluating Classifier Performance 



Predicted 



Actual 



Yes 
Ho 


6954 
412 


46 7000 1 99.34 


2588 3000 


86.27 
95.42 


Total 


7366 


2634 10000 



Example Buys.Computer Confusion Matrix 



- 



Model Evaluation 

Metrics for Evaluating Classifier Performance 



Predicted 



o 



Actual 









uo\ 


M 


90 


210 300 


fio 


UO 


9560 9700 


Vs 


Total 


230 


9770 10000 


(%A 



Example Cancer Confusion Matrix 
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fc&hi accuracy Jt >jj>\j,J&; S S> <s±£ j* ^^J'J* c l assi f ier ^•^ ^J 

: S S? ks*£ j* u>£\ zh*& 3J» vi»S5 ^s<x&* 

JL training JU oj£=*** data set Jt 3/2 e-*»* j>\jz^>jc±* Holdout (1 

test JU c^v* 1/3 
*^. ^ j£"I «_^ .^ ^ o>U ^111 .* .* Random Subsampling (2 

Cross-Validation (3 
L* ^ o>U ,JJI t$> 3/1 . 3/2 ^c— 5<L Jj^ ^ t* k-fold cross-validation (A 
Fold lX ^ t ^- JSV, ^ ^*- Jtf X^A ur -A- ^-i* M^^t »h^-t ^c-i* 
training JU ^hJI ^ j>hj test JU ^IsIL training JU *>^~* ^^ jl»-L. 

ol .v- 10 Folds ^j^l-w ol*, test JU ^UL 



P 



P 



» _ 2 correct classifications for all k iterations 



dataset size 



c>\ ±y^ c^=J J^j^J^h Jh U Stratified k-fold cross-validation (B 

JUSt data set Jl JbU U^>/ 
j£\ fj>*^\ \J& A ^kX kS>j paging Jl UJL^ Jli*, Ensemble Methods (C 

Data set Jl yj classifier ^ 



Model Evaluation 

Holdout and Random Subsampling 

<* Holdout -^ RANDOMLY allocate 2/3 of data for training and 

remaining 1/3 for testing 
** Random Subsampling -> Repeat holdout k times and take 

average accuracy 
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Model Evaluation 

Cross-validation 

• k-fold cross-validation -> randomly partition dataset into k 
mutually exclusive folds of approximately equal size 

• In iteration ifgld^ is test set and all other folds are training set 

2 correct classifications for all k iterations 

• Accuracy = - r 

dataset size 

• Stratified k-fold cross-validation -> class distribution in each 
fold is approximately the same as in initial dataset 

** Stratified 10-fold cross-validation is recommended 



- 



I improving Classification Accuracy 

Ensemble Methods 

** Ensemble -> a set of classifiers, each with a vote for a class 
label 
^ Each base classifier is produced from a different partition 
of the dataset 

■«* Majority voting is used to compose an aggregate 
classification 
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Improving Classification Accuracy 

Ensemble Methods- Bagging 



Algorithm: Bagging. The bagging algorithm — create an ensemble of classification models 
for a learning scheme where each model gives an equally weighted prediction. 

Input: 

l\ a set of d t rai ni ng tuples; 
Jt. the number of models in the ensemble; 
■ a classification learning scheme (decision tree algorithm, naive Bayesian, etc). 

Output: The ensemble — a composite model. M*. 
Method: 

{ L ) for i = 1 if i i ' i n nil Jni i I I 

(2 } creat^ootstrap sa mplej}, , by sam pling D with replacement ; 

(3) use L\JllLl Lhf limiUIIJ^dicmc to derive a model , M^i 

(4) end for 

To use the ensemble to classify a tuple* A": 

let each of the k models classify A' and return the majority vote; 
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Improving Classification Accuracy 

Ensemble Methods- Baeeine 





icome 


student 


Credit_ra t i n g 


Class: buys_ 


^cjomguter 


1 


youth 




high 


no 


fair 


no 




2 


youth 




high 


no 


excellent 


no 




3 


middle aged 


high 


no 


fair 


yes 




4 


senior 




medium 


no 


fair 


yes 




5 


senior 




low 


yes 


fair 


yes 




6 


senior 




low 


yes 


excellent 


no 




7 


middle 


aged 


low 


yes 


excellent 


yes 




S 


youth 




medium 


no 


fair 


no 




9 


youth 




low 


yes 


fair 


yes 




io 


senior 




medium 


yes 


fair 


yes 




11 


youth 




medium 


yes 


excellent 


yes 




12 


middle aged 


medium 


no 


excellent 


yes 




13 


middle 


aged 


high 


yes 


fair 


yes 




14 


senior 




medium 


no 


excellent 


no 







3 




14 




5 




Bootstrap 


3 




| 7 




4 


sampling 
with 

replacement 


3 


7 


13 


13 




9 




12 




6 




IO 
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1- Why is naive Bayesian classification called "naive"? Briefly outline the major ideas of naive 
Bayesian classification. 

Answer: 

Naive Bayesian is called that way because it assumes class conditional independence, which 
means that the effect of an attribute value on a class is independent of the values of other 
attributes. 

2- Following is a hypothetical employment statistics record for recent graduates: 



ID 


major 


avg. project score 


avg. exam score 


co-op? 


employed? 


salary 


1 


Computer 


87 


75 


Y 


Y 


60000 


2 


History 


? 


92 


N 


N 


? 


3 


Computer 


77 


95 


N 


Y 


50000 


4 


Engineering 


97 


65 


N 


N 





5 


Engineering 


84 


75 


Y 


Y 


40000 



What data preprocessing tasks are required for this data set? Briefly explain how you 
will apply them. 



Answer 



P 



P 



Data cleaning: We will need to handle missing values for the graduate whose ID = 2. We 
may choose to replace his average project score with the mean (86.25} — although, 
being a history major, this graduate may require getting a mean for his grad project 
from other history major graduates' data and not the computer/engineering majors' 
data present here. Salary, however, is a tricky thing. Intuition tells us that he should 
have a salary, just like the graduate 4, who's also unemployed. Therefore, we are using 
the most probable value. 

Data integration: not needed. 

Data reduction: We need to drop the ID attribute since it has no use in classification. 
Also, we may want to drop the salary attribute since it is dependent on the employed 
attribute (positive correlation). This will handle the missing value of the salary for ID 2 
and will also solve the normalization issue that is highlighted in the data transformation 
paragraph next. 

Data transformation: We may want to discretize the scores using binning in order to 
make all the attributed nominal. Alternatively, we may keep the data as is and iust | 

normalize the numerical attributes so that the salary does not overwhelm the other 



attributes when computing the distance. 
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b. Discuss which attribute would be picked for the first split by the decision tree 
algorithm. Work out the Information Gain for the chosen attribute. 

Answer 

b. First of all, we need to decide which attribute would be the class attribute. Since this is 
an employment dataset, the logical thing to do is choose "employed" to be the class 
label. Otherwise, it is ok to choose any attribute for prediction. 

We then would need to discuss the computations needed to get the information gain 
(gain of the class label, gain of an attribute as the difference between the expected 
information from class label and the expected information after the split based on that 
attribute). Then we need to compute the information gain for all attributes in order to 
decide which one will be the first to be used for the split in the decision tree. 
Calculations are as follows: 



p 



Hi 

Z3 3 2 2 

Pi log 2 (Pi) = ~ 5 X log 2 5 ~ 5 x lo §2 g 

in fo ma j or (employed) = SJ=it4" x info(Df) f where j = computer, history, 

engineering 
2/2. 2\ 1 / i. i\ 2 / i. ii, i\ 

Gain(major) = info(employed) — info ma j or (employed) 

And we do these calculations for the rest of the attributes. Results will depend on how 
you handled missing values and whether you did normalization/discretization or not. 

Remember, you will work on the data after it is preprocessed and not before. 
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c. Could you use a k-Nearest Neighbor Classifier with this dataset? If so, give the 3 
nearest neighbors and resulting class for the following data instance: 



p 



ID 


major 


avg. project score 


avg. exam score 


co-op? 


employed? 


salary 


6 


History 


86 


74 


Y 


? 


27000 



Note that you have attributes of mixed types , Use the following rules: 

i. If attribute a is numeric: df,- — }a — , where max n andmin n are maximum 

iJ max a -mm a 

and minimum values in the attribute 1 . 



ii. If attribute a is nominal or binary: dfj = if x ia = Xj a and 1 otherwise. 
Distance is calculated using the following equation: 



da/) 



La=l°ij a ij 



La=l°i; 



M 



iJ 



P 



Where S f y is an indicator -> 8^ = if either x ia or Xj a is missing OR if x ia = x ; - a = 

and attribute o is asymmetric binary. Otherwise 5>. = 1. (Breath, Calculate the 
individual distances per attribute first, then compute the indicators, then plug all into 
the distance equation.) 

Answer 

c. Technically yes we can use the KNN to classify the record. Let's compute one distance, 
and the others will be similar. Remember that you need to compute the distance to all 
the records in order to choose the NN. Assuming we'll ignore the salary and won't 
normalize nor discretize, calculations proceed as follows: 

d(6,l) = 

vm jet a )jt a ) -(maj or) .(major) , „(projscore) .(pro j score) , ^(examscore) Jexamscore) , ^(coop) J coop) 






(a) 



„ (major) s (proj score) ^(examscQre) -(coop) 
6 6,l +6 6,l +6 6,l +d 6,l 



Assuming co-op is an asymmetrical binary, we proceed with the calculations: 



d(6,l) 

1 x (history ^ computer -» 1) + 1 X 



|86-87| 
97-77 



1 X 



|74- 75| 
95-65 



1 x (yes = yes -> 0) 
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n 
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1X1 + 1X 0.05 + 1 X 0.033 + 1X0 
d(6,l) = = 0.2708 



Details: 



8^ ajor) d ( ^ 1 ajor) ^ since major is categorical, and not binary, we don't need the asymmetrical 

rule. Since both values are not missing, we have 5j^ ajar = l.This indicator is a weight that 
tells the equation to consider these values in the distance 2 . If major values of the two records 
are equal, distance =0, otherwise distance =L Since history and computer are a mismatch, 
distance is 1, and the major indicator = 1 because there is no missing major values between 6 
and 1. 

S^™ JSC 7e Jd [ P™ JSC re > -$ project score is numerical. Both values are present, so we have the 
indicator = L We calculate the distance using the equation as indicated above, with the min and 
max from the data itself. 

8^* a£° -> we assumed that co-op is asymmetrical binary, then we'll see if both values 
are no. since that is not the case, the indicator =1. Since both values are yes, we have a match 
and the distance = 0. 

And so on for the rest of the attributes (ignoring salary as we have pointed out at the 
beginning). 
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